Linguistic Structured Sparsity in Text Categorization

نویسندگان

  • Dani Yogatama
  • Noah A. Smith
چکیده

We introduce three linguistically motivated structured regularizers based on parse trees, topics, and hierarchical word clusters for text categorization. These regularizers impose linguistic bias in feature weights, enabling us to incorporate prior knowledge into conventional bagof-words models. We show that our structured regularizers consistently improve classification accuracies compared to standard regularizers that penalize features in isolation (such as lasso, ridge, and elastic net regularizers) on a range of datasets for various text prediction problems: topic classification, sentiment analysis, and forecasting.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discovering Sociolinguistic Associations with Structured Sparsity

We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors’ geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite `1,∞ regularizer, we obtain structured sparsity, driving entire rows of coefficients t...

متن کامل

Sparse Models of Natural Language Text

In statistical text analysis, many learning problems can be formulated as a minimization of a sum of a loss function and a regularization function for a vector of parameters (feature coefficients). The loss function drives the model to learn generalizable patterns from the training data, whereas the regularizer plays two important roles: to prevent the models from capturing idiosyncrasies of th...

متن کامل

A Comparative Study in Relation to the Translation of the Linguistic Humor

Mark Twain made use of repetition and parallelism as two comedic literary devices to bring comic effect to the readers. Linguistic devices of humor, repetition and parallelism seemed to create many difficulties in the translation of literary texts. The present study applied Delabatista‟s strategies for translating wordplays such as repetition and parallelism in the translation of humorous texts...

متن کامل

Large-Scale Linguistic Ontology as a Basis for Text Categorization of Legislative Documents

The paper describes the structure and properties of a large linguistic ontology – a new kind of information retrieval thesaurus Thesaurus on Sociopolitical Life for Conceptual Indexing. The thesaurus is used in various realscale information-retrieval applications in the legal domain. At present one of the main applications of the Thesaurus is knowledge-based text categorization. Categories are ...

متن کامل

Evaluating the use of linguistic information in the pre-processing phase of Text Mining

This work proposes and evaluates the use of linguistic information in the pre-processing phase for text mining tasks applied to Portuguese texts. We present several experiments comparing our proposal to the usual techniques applied in the field. The results show that the use of linguistic information in the pre-processing phase brings some improvements for both text categorization and clustering.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014